Two of the most important datastructures in Pandas are the Series and DataFrame.

Series

A pandas Series is just an array with an associated index. Basically just an associative array, or in python terms, a fixed length ordered dictionary.


In [3]:
from pandas import Series
Series([1,2,3,4,5])
# Note the index on the left hand side when printed


Out[3]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]:
a = Series([1, 2, 3, 4, 5])
a.values, a.index # Notice how the values of a series is just an array and the Index is a special type


Out[4]:
(array([1, 2, 3, 4, 5]), RangeIndex(start=0, stop=5, step=1))

In [24]:
# By default a Series doesn't have a name, but you can assign it one:
Series([1,2,3,4], name="My Series")


Out[24]:
0    1
1    2
2    3
3    4
Name: My Series, dtype: int64

In [5]:
# You can change the index if you want
indexed_series = Series([1,2,3,4,5], index=['a','b','c','d','e'])
indexed_series


Out[5]:
a    1
b    2
c    3
d    4
e    5
dtype: int64

In [6]:
# You use the assigned index or a numeric index to access elements
indexed_series['c'], indexed_series[0]
# In other words, a Series is really just an ordered dict


Out[6]:
(3, 1)

In [67]:
# In fact, you can create a series right from a python dict
myseries = Series({'Netherlands': 5, 'Belgium': 10, 'France': 21})
myseries


Out[67]:
Belgium        10
France         21
Netherlands     5
dtype: int64

In [68]:
# You can access it by number of by index, note that Belgium comes first as the dictionary is sorted by key
myseries[0], myseries['Netherlands']


Out[68]:
(10, 5)

Like with regular numpy arrays, you can create boolean masks on Series and use them to filter out certain values in the Series, like so:


In [69]:
myseries[myseries % 2 == 1] # filter out odd values


Out[69]:
France         21
Netherlands     5
dtype: int64

In [71]:
# Sorting by value:
myseries.sort_values()


Out[71]:
Netherlands     5
Belgium        10
France         21
dtype: int64

A useful feature about Series is that the operations that can act on it are often smart enough to combine values by index. For example, when adding to Series together, the values of the same indices (=keys) will automatically be summed together, like so:


In [13]:
a_dict = {'Netherlands': 5, 'Belgium': 10, 'France': 21, 'UK': 5}
b_dict = {'Netherlands': 2, 'Belgium': 1, 'France': -9, 'USA': 4}
a = Series(a_dict)
b = Series(b_dict)
a+b


Out[13]:
Belgium        11.0
France         12.0
Netherlands     7.0
UK              NaN
USA             NaN
dtype: float64

This is not supported in python by default:


In [14]:
a_dict + b_dict # Doesn't work, TypeError


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-0a1691d054e4> in <module>()
----> 1 a_dict + b_dict # Doesn't work, TypeError

TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

DataFrame

Where a Series is like a single associative array, a DataFrame is a rectangular table of data which contains an ordered collection of columns (i.e. Series).


In [1]:
# python list of dictionaries that all have the same keys
customers = [
    {'name': "John", 'last_name': "Smith", 'age': 43, 'customer_nr': 12345 },
    {'name': "Mary", 'last_name': "Potter", 'age': 25, 'customer_nr': 67889 },
    {'name': "Rose", 'last_name': "Harrison", 'age': 39, 'customer_nr': 23456 },
    {'name': "John", 'last_name': "Ford", 'age': 56, 'customer_nr': 99999 },  
    {'name': "Patrick", 'last_name': "Harrison", 'age': 41, 'customer_nr': 7777 },
]
# print customers

In [2]:
# let's create a dataframe from this
import pandas as pd
customers_df = pd.DataFrame(customers)
print customers_df # As you'll notice, pandas does pretty printing!


   age  customer_nr last_name     name
0   43        12345     Smith     John
1   25        67889    Potter     Mary
2   39        23456  Harrison     Rose
3   56        99999      Ford     John
4   41         7777  Harrison  Patrick

In [15]:
import numpy as np
# customers_df.select(lambda x: x.name == "John")
customers_df[customers_df['name'] == "John"]


Out[15]:
age customer_nr last_name name
0 43 12345 Smith John
3 56 99999 Ford John

In [18]:
# Alternative way to define a the same thing
customers = {
    'name': ["John", "Mary", "Rose", "John", "Patrick"],
    'last_name': ["Smith", "Potter", "Harrison", "Ford", "Harrison"],
    'age': [43,25,39,56,41],
    'customer_nr': [12345, 67889, 23456,99999,7777]
}
customers_df = pd.DataFrame(customers)
print customers_df


   age  customer_nr last_name     name
0   43        12345     Smith     John
1   25        67889    Potter     Mary
2   39        23456  Harrison     Rose
3   56        99999      Ford     John
4   41         7777  Harrison  Patrick

In [47]:
# By default, rows are labeled by numbers, but you can also give them labels, like so
population = {'Nevada': {2001: 2.4, 2002: 2.9},
                 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pd.DataFrame(population)


Out[47]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

Dataframe manipulation

Pandas then has a bunch of methods on the DataFrame that allows you to easily manipulate the data.


In [33]:
# accessing a certain column (this is a Series that get's returned), either by key or attribute name
print customers_df['name'], "\n\n", customers_df.name


0       John
1       Mary
2       Rose
3       John
4    Patrick
Name: name, dtype: object 

0       John
1       Mary
2       Rose
3       John
4    Patrick
Name: name, dtype: object

In [36]:
# To access a certain row, use df.loc:
customers_df.loc[2] # Note that rows are also Series (you can tell from the output because it has a Name and dtype)


Out[36]:
age                  39
customer_nr       23456
last_name      Harrison
name               Rose
Name: 2, dtype: object

In [72]:
# Transpose:
customers_df.T


Out[72]:
0 1 2 3 4
age 43 25 39 56 41
customer_nr 1234 1234 1234 1234 1234
last_name Smith Potter Harrison Ford Harrison
name John Mary Rose John Patrick
signup_date 2017-05-25 21:27:58.376248 2017-05-25 21:27:58.376248 2017-05-25 21:27:58.376248 2017-05-25 21:27:58.376248 2017-05-25 21:27:58.376248

In [74]:
# Sort by a certain column:
customers_df.sort_values(by='age') # You can also sort by multiple columns by passing an array


Out[74]:
age customer_nr last_name name signup_date
1 25 1234 Potter Mary 2017-05-25 21:27:58.376248
2 39 1234 Harrison Rose 2017-05-25 21:27:58.376248
4 41 1234 Harrison Patrick 2017-05-25 21:27:58.376248
0 43 1234 Smith John 2017-05-25 21:27:58.376248
3 56 1234 Ford John 2017-05-25 21:27:58.376248

In [20]:
# count occurences
customers_df['name'].value_counts() # there are 2 johns!


Out[20]:
John       2
Mary       1
Patrick    1
Rose       1
Name: name, dtype: int64

In [78]:
# sum() sums that columns that have a dtype that can be summed. 
# describe() provides statistical data for the columns with a dtype for which it can be done
# There's a lot of more of these convenience functions like mean, median, prod, std, var, min/max, etc.
print customers_df.sum(), "\n\n", customers_df.describe()


age             204
customer_nr    6170
dtype: int64 

             age  customer_nr
count   5.000000          5.0
mean   40.800000       1234.0
std    11.054411          0.0
min    25.000000       1234.0
25%    39.000000       1234.0
50%    41.000000       1234.0
75%    43.000000       1234.0
max    56.000000       1234.0

In [40]:
# Add a new column and assign a value for existing column
import datetime
customers_df['signup_date'] = datetime.datetime.now()
customers_df ['customer_nr'] = 1234 # Set all values to 1234, you can also pass an array to specify each value
customers_df


Out[40]:
age customer_nr last_name name signup_date
0 43 1234 Smith John 2017-05-25 21:27:58.376248
1 25 1234 Potter Mary 2017-05-25 21:27:58.376248
2 39 1234 Harrison Rose 2017-05-25 21:27:58.376248
3 56 1234 Ford John 2017-05-25 21:27:58.376248
4 41 1234 Harrison Patrick 2017-05-25 21:27:58.376248

In [41]:
# Boolean masks: select all customers that are older than 40
# Note the use of 3
customers_df[customers_df['age'] > 40]


Out[41]:
age customer_nr last_name name signup_date
0 43 1234 Smith John 2017-05-25 21:27:58.376248
3 56 1234 Ford John 2017-05-25 21:27:58.376248
4 41 1234 Harrison Patrick 2017-05-25 21:27:58.376248

In [42]:
# you can plot counts (do matplotlib plots inline)
%matplotlib inline 
name_counts = customers_df['name'].value_counts()
name_counts.plot(kind='barh') # barh = bar horizontal


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x115f5c390>

Dataframes and functions


In [54]:
# Let's define a dataframe
numbers = pd.DataFrame({
    'elevation': [123, -23, 456],
    'signal_strength': [-783, 123, 453]
})
numbers


Out[54]:
elevation signal_strength
0 123 -783
1 -23 123
2 456 453

In [53]:
# Since a dataframe is just a Series of Series, you can apply numpy functions to it, like so:
import numpy as np
np.abs(numbers)


Out[53]:
elevation signal_strength
0 123 783
1 23 123
2 456 453

In [57]:
# Apply a custom function to a column (=series at a time)
numbers.apply(lambda col_series: col_series + 100)


Out[57]:
elevation signal_strength
0 223 -683
1 77 223
2 556 553

In [79]:
# Apply a custom function to an element at a time
numbers.applymap(lambda val: 0 if val < 100 else val)


Out[79]:
elevation signal_strength
0 123 0
1 0 123
2 456 453